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Abstract 

Bird sound data collected with unattended 
microphones for automatic surveys, or mo- 
bile devices for citizen science, typically con- 
tain multiple simultaneously vocalizing birds 
of different species. However, few works have 
considered the multi-label structure in bird- 
song. We propose to use an ensemble of clas- 
sifier chains combined with a histogram-of- 
segments representation for multi-label clas- 
sification of birdsong. The proposed method 
is compared with binary relevance and three 
multi-instance multi-label learning (MIML) 
algorithms from prior work (which focus 
more on structure in the sound, and less on 
structure in the label sets). Experiments 
are conducted on two real-world birdsong 
datasets, and show that the proposed method 
usually outperforms binary relevance (using 
the same features and base-classifier), and is 
better in some cases and worse in others com- 
pared to the MIML algorithms. 



1. Introduction 

The most familiar formulation of supervised classifica- 
tion associates single feature- vectors with single labels, 
hence it is called single-instance single-label (SISL). 
For example, SVM and logistic regression are SISL 
classifiers. One common setup involving SISL clas- 
sifiers is to use a segmentation algorithm to extract 
"syllables" or calls of bird sound from a recording, 
each of which is described by a feature vector. A SISL 
classifier is trained on a collection of syllables paired 
with species labels, then predicts the species for a new 
syllable (Fagerhmd, 2007; Damoulas et al., 2010). 

Many of the audio recordings used in SISL experiments 
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are collected with a directional microphone aimed by 
a person at the bird of interest. This method produces 
recordings where the targeted bird is louder than other 
sound sources in the environment. Audio data col- 
lected by unattended microphones for the purpose of 
acoustic monitoring, and audio collected with mobile 
devices for citizen science are less ideal; it is common to 
have multiple simultaneously vocalizing bird species, 
in addition to other sources of noise such as non-bird 
species, wind, rain, streams, and motor vehicles. Few 
works have addressed these complexities in real-world 
data (Brandes, 2008; Briggs et al., 2012c). 

There are two kinds of structure in bird sound data 
that can be exploited through alternative frameworks 
for supervised classification. First, bird sound is natu- 
rally decomposed into a collection of parts, e.g., sylla- 
bles, which motivates a multi-instance learning (MIL) 
approach (Dietterich et al., 1997). Second, multi-label 
classification (MLC) (Tsoumakas & Katakis, 2007) is 
a natural fit for bird sound because an audio record- 
ing can be associated with a set of species (and other 
sounds) that are present. Multi-instance multi-label 
learning (MIML) combines both ideas. MIML has 
previously been used for classification of bird sound 
recordings containing multiple simultaneously vocal- 
izing species (Briggs et al., 2012c). However, prior 
work on MIML for bird sound has focussed more on 
the multi-instance structure of the sound, and less on 
structure in the species/label sets. 

The MLC framework has not been directly applied to 
bird sound (although some MIML algorithms which 
have been applied to bird sound can be considered 
a reduction to MLC, e.g., MIML-kNN (Zhang, 2010) 
and MIML-RBF (Zhang & Wang, 2009)). Ensemble 
of classifier chains (ECC) (Read et al., 2011) is an 
algorithm for MLC which has recently been applied 
to species distribution modeling, where the goal is to 
predict the set of bird species present at a site from a 
feature vector describing physical and biological prop- 
erties of the site. Yu et al. (Yu et al., 2011) suggested 
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that ECC achieves better performance in this domain 
than binary relevance because it can exploit correla- 
tions in the label sets. Considering this observation, 
we hypothesize that ECC can exploit the same struc- 
ture while predicting sets of bird species from an acous- 
tic feature vector instead of environmental covariates. 

We formulate the classification problem similarly to 
(Briggs et al., 2012c). The training data consists of 
audio recordings paired with a set of species that are 
present. The goal is to predict the set of species in a 
new recording which is not part of the training data. 

To apply MLC, it is necessary to represent each audio 
recording with a fixed-length feature vector. We ap- 
ply a 2D time-frequency supervised segmentation algo- 
rithm similar to (Ncal et al., 2011; Briggs et al., 2012c), 
then compute the same features as in (Briggs et al., 
2012c) to describe each segment. Then we use a clus- 
tered codebook to obtain a histogram-of-segments for 
each recording. (Somervuo & Harma, 2004) used his- 
tograms to represent variable-length sequences of syl- 
lables. (Briggs et al., 2009) used histograms of frame- 
level features (spectrum and MFCC) to represent an 
audio recording with a single species of bird. 

We compare ECC, binary relevance (BR), and results 
from prior work on two real- world datasets of birdsong 
with multiple simultaneously vocalizing species. 

The first dataset was collected with unattended om- 
nidirectional microphones in the H. J. A. (HJA) Ex- 
perimental Research Forest, and has previously been 
used in several classification experiments (Briggs et al., 
2012c;a;b; Liu & Dietterich, 2012) 

The second dataset is new, and consists of record- 
ings of birds made with an iPhone in a residential 
neighborhood (collected and labeled by the authors). 
The new iPhone birdsong dataset presents the same 
multi-species issues as the HJA Birdsong dataset, 
but is arguably more challenging because there are 
more/louder sources of background noise and non-bird 
classes (especially motor vehicles and insects). 

Results are analyzed in terms of standard multi-label 
error measures: Hamming loss, set 0/1 loss, rank loss, 
1-error, and coverage. ECC achieves better results 
than BR in the majority of comparisons, and ECC 
with no parameter tuning is better than one and worse 
than two of the MIML algorithms (which have an un- 
fair advantage of using post-hoc parameter tuning). 

2. Problem Statement 

In MLC, the training dataset is (xi, Yi), . . . , (x„, y„) 
where x^ G M^' is a feature vector, and Yi C y = 



{1, . . . , c} is a subset of c possible class labels. The goal 
is to learn a classifier /(x) : M'' — ?► 2-*^ which predicts a 
label set from a given feature vector. It is common to 
implement and evaluate multi-label classifiers based on 
a score function for each class /j(x) : M'' — ?► M, which 
represents the predicted confidence that label j is in 
the set. The set predictor / is defined in terms of the 
score functions /i, . . . , /c. The MLC framework maps 
to acoustic species classification as follows: each audio 
recording is associated with a feature vector, and the 
set of species audible in the recording is the label set. 

MIML is a related framework where the training data 
consists of bags-of-instances paired with label sets, 

(Si, Yi), . . . , (S„,r„) where Bi = {xji, . . . ,Xi„J (1) 

We will use MIML as an intermediate representation of 
audio recordings of bird sound, and solve the problem 
by a reduction from MIML to MLC. 

3. Background 

Binary relevance is one of the simplest algorithms for 
MLC. It is a reduction to SISL where binary pre- 
diction of each label is treated as a completely sep- 
arate/independent problem. To refer to a bit in the 
binary representation of a label set, let Y^ = I[j € 
Yi\. BR creates c SISL datasets Di, . . . ,Dc, where 
Dj = {(xi, y/)}"^^, and trains a binary SISL classifier 
/,- : M'* ^ M on each Dj . 

Classifier chains are also a reduction to SISL, but the 
problems for each class are not totally separate. CC 
predicts bits of the label set one at a time in a par- 
ticular order, and uses all of the previously predicted 
bits as features for the next bit. CC creates c SISL 
datasets Di, . . . , Dc, where 



D, - {(x, ® Y, 



l:j-l 






(2) 



The notation i^ "'" denotes the first j — 1 bits of the 
binary representation of F^, and ® is vector concatena- 
tion. CC trains a binary SISL classifier fj : IR''+^~^ — >• 
K on each dataset Dj. Algorithm 1 is pseudocode for 
classification of a feature vector x with CC. Assuming 
the SISL classifier fj outputs a score or probability, a 
threshold t is used to make a 0/1 prediction. 

ECC creates an ensemble of L classifier chains, where 
each chain I — 1, . . . ,L views the classes in a differ- 
ent random permutation tt; : {1, . . . , c} — ^ {1, . . . , c}. 
Each chain in the ensemble votes on each potential 
class in the label set. For each chain / and class j, 
ECC trains a SISL classifier fij on the dataset 



Aj = {(x.®r, 



Mi) 



,^^,(.^i)^^^,W}^^ (3) 
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Algorithm 1 Classifier Chains - classify x 



Y=[] 

for j = 1 to c do 



Y = YQ 
end for 
return Y 



/[/,(x,®r)>i] 



Algorithm 2 ECC-RF - class scores for x 

score[l, . . . ,c] =0 
for I = I to L do 

x' = X 

for j = 1 to c do 

Pij = /ij(x') 

score[T:i{j)] = score[Ki{j)] + pij 
if j 7^ 1 then 
x' = x'(Bpij 
end if 
end for 
end for 
return scores/ L 



4. Proposed Methods 

4.1. Classifier Chains with Random Forest 

We implement ECC with a Random Forest (RF) as 
the base-SISL classifier, hence we call the proposed 
classifier ECC-RF. Because RF outputs a probability, 
the ensemble can be viewed as an instance of the En- 
semble of Probabilistic Classifier Chains (EPCC) al- 
gorithm (Dcmbczynski et al., 2010). Therefore it is 
reasonable to aggregate probabilities from each SISL 
classifier rather than 0/1 votes. The aggregated prob- 
abilities are used as the score-functions for each class. 
Algorithm 2 gives pseudocode we use to generate a 
class-score vector with ECC-RF, given input x. 

4.2. Out-Of-Bag Calibrated Thresholds 

Sometimes class scores are sufficient, for example to 
rank species from most likely to least likely to be 
present. However, it is often desirable to obtain a 
specific predicted label set. A label set can be ob- 
tained by comparing each score to a threshold. The 
simplest method is to use a single threshold for all 
classes (Tsoumakas & Katakis, 2007). We instead se- 
lect a separate threshold for each class, which is cali- 
brated using out-of-bag (OOB) estimation (Breiman, 
2001) (for both BR and ECC-RF). Consider one of 
the binary RF's in BR or ECC-RF, /,- or fip Let its 
OOB estimate on instance Xj in the training dataset 
be fj{-x..i,i) (for BR) or fij{ii.i,i) (for ECC). For each 
class j, we select a threshold tj to minimize the 0/1 



error on that class, comparing ground-truth labels for 
class j with OOB estimates. The threshold used in 
BR for class j is 

n 

t,= argmin ^ /[/[/, (x„i) > i] = l^^'] (4) 

te{.001,...,.999} j^^ 

The same algorithm is applied to ECC-BR by defining 

5. Experiments 

5.1. Datasets 

Two real-world birdsong datasets are used in our ex- 
periments. 

HJA Birdsong The HJA Birdsong dataset consists 
of 548 ten-second audio recordings collected in the H. 
J. A. Experimental Research Forest, using Songmeter 
SMI recording devices. There are 13 species in this 
dataset, with between 1 and 5 species per recording 
(2.144 average). The most common sources of noise in 
this dataset include streams and wind. Further details 
of this dataset are available in (Briggs et al., 2012c). 
(Briggs et al., 2012c) used 5-fold cross-validation for 
this dataset. We use the same 5-fold partitions, so the 
results are comparable. 

iPhone Birdsong We collected 150 five-second au- 
dio recordings of bird sound with an iPhone 4G in a 
residential neighborhood. 54 of the recordings were 
collected during the dawn chorus on a single day, and 
the rest were collected at different times of day over 
several months in 2012-13. 

We filtered the original 150 recordings down to 91 
which are more suited for a cross-validated species 
classification experiment. There were 32 recordings 
with bird species we were unable to identify, and many 
more with non-bird sounds. We removed all recordings 
containing unknown bird species, amphibians, human 
voice, dogs barking, and the iPhone vibrating due to 
receiving a message. Finally, we remove all record- 
ings containing a species which appears only once in 
the dataset (cross-validation is not reasonable in this 
case). The filtered subset of 91 recordings contains 14 
species. Many of these recordings still contain motor 
vehicle noise, loud insects, and "click noises" which 
appear as vertical lines in the spectrogram. Table 1 
lists each species, and the number of recordings it ap- 
pears in. Note that the dataset is highly unbalanced. 
Because this is smaller dataset, we use 10- fold cross- 
validation instead of 5-fold. 
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Table 1. The number of recordings containing each species 
in the iPhone Birdsong dataset. 



Species 


Recordings 


American Goldfinch 


2 


American Robin 


23 


Black Capped Chickadee 


36 


Black-headed Grosbeak 


2 


Chestnut Backed Chickadee 


3 


Golden Crowned Kinglet 


6 


Great Horned Owl 


2 


KiUdeer 


7 


Marsh Wren 


3 


Northern Flicker 


4 


Red Breasted Nuthatch 


19 


Red- Winged Blackbird 


23 


Spotted Towhee 


13 


Stellar's Jay 


4 



5.2. Histogram-of-Segments Representation 

In order to apply MLC, we represent each audio file 
with a fixed-length feature vector. Prior work (Briggs 
ct al., 2012c) has shown that 2D time- frequency seg- 
mentation of a spectrogram is useful for separating 
bird sounds which may overlap in time. For the new 
iPhone Birdsong dataset, we follow a similar process to 
(Briggs et al., 2012c) for supervised 2D segmentation 
of spectrograms.^ 

Each segment is isolated, and described by the same 
38 acoustic features as in (Briggs et al., 2012c). At 
this point, the audio dataset is represented as a MIML 
dataset (each recording is a bag of segments paired 
with a set of species). We reduce this MIML dataset 
to an MLC dataset by summarizing all of the segments 
in a recording with a histogram. Hence, the feature 
vector used for MLC has dimension fc, where k is the 
number of clusters. For the HJA Birdsong dataset, 
we use the original segmentation and segment features 
from (Briggs ct al., 2012c), rather than our slightly 
modified segmentation. 

Segment features are clustered using fc-meansH — h 
(Arthur & Vassilvitskii, 2007) to form a codebook. For 
each recording, each of its segments is mapped to a 
cluster center, and the normalized count of segments 
for each cluster is used as the histogram-of-segments 
feature. Figure 1 shows some example clusters from 
the codebook for the iPhone dataset. 



^ There are some minor differences in segmentation in 
the iPhone dataset vs. the HJA dataset. For the iPhone 
dataset, the RF used for segmentation was trained on 
features consisting of pixels in an 17x17 window, the y- 
coordinate of the window center, and the average intensity 
in the window. This RF used 100 trees with a maximum 
depth of 10. We annotated 20 out of 91 of the spectrograms 
in the dataset with examples of correct segmentation. 
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Figure 1. Example clusters of segments in the codebook 
used in the construction of histogram-of-segment features 
for the iPhone dataset (modified to enhance contrast). 



5.3. Comparison to MIML 

Using results from (Briggs et al., 2012c) on the HJA 
dataset, we compare our proposed ECC-RF algorithm 
to three MIML algorithms: MIMLSVM, MIML-fcNN 
and MIMLRBF. Each of these algorithms are reduc- 
tions from MIML to MLC; they construct a single 
fixed- length feature vector from a bag of instances (i.e., 
a recording containing a varying number of segments) , 
then apply binary relevance. For BR, MIMLSVM uses 
SVM as the base-SISL classifier, while MIML-fcNN and 
MIMLRBF use linear models trained by unregular- 
ized min-squared-error. These MIML classifiers focus 
mainly on construction of a good "summary" feature 
vector, while using only the simplest MLC classifier. In 
contrast, our proposed method uses a simpler feature 
vector construction, and a more complicated model of 
structure in the label sets. 



the 



5.4. Parameters 

For constructing histogram of segment features, 
parameter to fc-meansH — h is fc = 50. 

The only parameters for ECC-RF are L, the number 
of chains, and T, the number of trees in each RF. It is 
expected that as these parameters are increased, the 
accuracy of the classifier converges to some asymptotic 
value. Hence selection of these parameters is mainly 
a matter of how much computation time is available. 
We conservatively chose L — 25, T — 25, and did no 
further optimization of these parameters.^ 



^Running 10 repetitions of 5- or 10-fold CV on both 
datasets with BR and ECC-RF takes 424 seconds on a Mac 
Pro with 2x2.4 GHz Quad-Core Intel Xeon Processors. The 
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For BR, the only parameter is T, the number of trees 
in each RF. We set T = 25^ for BR to ensure that 
the total number of trees which cast a vote in every 
prediction is the same between BR and ECC-RF. All 
decision trees used in both BR and ECC use a max- 
imum tree-depth of 15, and store histograms of class 
labels in decision tree leaves instead of the majority 
label. 

The three MIML algorithms that we compare to in 
this experiment have parameters which must be tuned 
(e.g., by grid search). These tuning parameters are 
unlike the parameters of ECC-RF. Although such pa- 
rameters can be optimized by cross-validation (with 
respect to a particular multi-label performance mea- 
sure) , doing so adds an order of magnitude runtime to 
the classification experiment, so (Briggs et al., 2012c) 
used "post-hoc" parameter selection. In post-hoc se- 
lection, the experiment is run multiple times for all 
combinations of parameter values in a grid, and the 
best result from any parameter is reported. There- 
fore the MIML algorithms have an advantage in these 
experiments. 

5.5. Results 

Table 2 lists results. Because RF and ECC are ran- 
domized, we run 10 trials, and report results averaged 
over all trials and folds of cross-validation. 

Following recommendations in (Demsar, 2006), we 
summarize results for multiple classifiers on multiple 
datasets by win-loss counts (and do not discard any re- 
sult as "insignificant"). However, unlike the scenario 
considered by (Demsar, 2006), we compare MLC clas- 
sifiers rather than SISL classifiers, so there are multiple 
performance measures. Because there are only a few 
datasets and more performance measures, we aggre- 
gate win/loss counts over all measures. 

Comparing BR and ECC-RF on two datasets with five 
different performance measures gives 10 comparisons 
between the two algorithms. Over both datasets, the 
win-loss count for ECC-RF vs. BR is 7-3. On the 
iPhone dataset, the result is less decisive; the count for 
ECC-RF vs. BR is 3-2. On the HJA Birdsong dataset, 
the count for ECC-RF vs. BR is 4-1. OveraU these 
results suggest there is an advantage to using ECC- 
RF over BR for multi-label classification of bird species 
sets, given the histogram-of-segments representation. 

Next we consider the win-loss counts on the HJA Bird- 
song dataset for ECC-RF vs. MIMLSVM, MIMLRBF, 
and MIML-A:NN. The counts are 5-0, 1-4, and 0-5, re- 



RF tree induction is parallel and the rest is sequential. The 
implementation is in C+-I- compiled with GCC 4.2. 



spectively, i.e. MIMLSVM is worse than ECC-RF in 
all comparisons, but MIMLRBF and MIML-fcNN are 
better than ECC-RF. However, this is not an entirely 
fair comparison due to post-hoc parameter selection in 
the MIML experiments. 

6. Discussion 

We suggest that the performance advantage of MIML- 
RBF and MIML-fcNN over ECC-RF may be attributed 
to better representation of the multi-instance structure 
in the data (compared to our histogram-of-segments 
representation). Based on comparisons between ECC 
and BR, better modeling of structure in the label set is 
beneficial when compared with the same features and 
base-SISL classifier. 

7. Related Work 

We focussed on learning to predict species label sets. 
Another interesting problem is to train on recordings 
with multiple labels, but classify segments with a sin- 
gle label. Such an approach reduces the labeling ef- 
fort required to train SISL segment/syllable classifiers 
such as (Fagerlund, 2007; Damoulas et al., 2010). This 
problem is naturally formulated in the framework of 
MIML instance annotation (Briggs et al., 2012a;b). A 
related formulation is to associate each segment with 
a set of candidate labels, only one of which is correct. 
This formulation is called ambiguous label classifica- 
tion (Cour et al., 2011), or superset label learning (Liu 
& Dictterich, 2012). 
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